Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][RunPod] Fix list instances in RunPod Provisioner #3878

Merged
merged 3 commits into from
Aug 28, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Aug 26, 2024

Sometimes when the RunPod cluster is in the process of being created, the port field in the runtime is None and that causes some problems (see logs below). This PR fixes the issue.

Traceback (most recent call last):
  File “/home/memory/install/miniconda3/envs/sky/bin/sky”, line 8, in <module>
    sys.exit(cli())
  File “/home/memory/install/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py”, line 1157, in __call__
    return self.main(*args, **kwargs)
  File “/home/memory/install/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py”, line 1078, in main
    rv = self.invoke(ctx)
  File “/home/memory/skypilot/sky/utils/common_utils.py”, line 367, in _record
    return f(*args, **kwargs)
  File “/home/memory/skypilot/sky/cli.py”, line 806, in invoke
    return super().invoke(ctx)
  File “/home/memory/install/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py”, line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File “/home/memory/install/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py”, line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File “/home/memory/install/miniconda3/envs/sky/lib/python3.9/site-packages/click/core.py”, line 783, in invoke
    return __callback(*args, **kwargs)
  File “/home/memory/skypilot/sky/utils/common_utils.py”, line 388, in _record
    return f(*args, **kwargs)
  File “/home/memory/skypilot/sky/cli.py”, line 2563, in down
    _down_or_stop_clusters(clusters,
  File “/home/memory/skypilot/sky/cli.py”, line 2881, in _down_or_stop_clusters
    subprocess_utils.run_in_parallel(_down_or_stop, clusters)
  File “/home/memory/skypilot/sky/utils/subprocess_utils.py”, line 65, in run_in_parallel
    return list(p.imap(func, args))
  File “/home/memory/install/miniconda3/envs/sky/lib/python3.9/multiprocessing/pool.py”, line 870, in next
    raise value
  File “/home/memory/install/miniconda3/envs/sky/lib/python3.9/multiprocessing/pool.py”, line 125, in worker
    result = (True, func(*args, **kwds))
  File “/home/memory/skypilot/sky/cli.py”, line 2853, in _down_or_stop
    core.down(name, purge=purge)
  File “/home/memory/skypilot/sky/utils/common_utils.py”, line 388, in _record
    return f(*args, **kwargs)
  File “/home/memory/skypilot/sky/core.py”, line 404, in down
    backend.teardown(handle, terminate=True, purge=purge)
  File “/home/memory/skypilot/sky/utils/common_utils.py”, line 388, in _record
    return f(*args, **kwargs)
  File “/home/memory/skypilot/sky/utils/common_utils.py”, line 367, in _record
    return f(*args, **kwargs)
  File “/home/memory/skypilot/sky/backends/backend.py”, line 116, in teardown
    self._teardown(handle, terminate, purge)
  File “/home/memory/skypilot/sky/backends/cloud_vm_ray_backend.py”, line 3480, in _teardown
    self.teardown_no_lock(
  File “/home/memory/skypilot/sky/backends/cloud_vm_ray_backend.py”, line 3809, in teardown_no_lock
    provisioner.teardown_cluster(repr(cloud),
  File “/home/memory/skypilot/sky/provision/provisioner.py”, line 228, in teardown_cluster
    provision.terminate_instances(cloud_name, cluster_name.name_on_cloud,
  File “/home/memory/skypilot/sky/provision/__init__.py”, line 47, in _wrapper
    return impl(*args, **kwargs)
  File “/home/memory/skypilot/sky/provision/runpod/instance.py”, line 143, in terminate_instances
    instances = _filter_instances(cluster_name_on_cloud, None)
  File “/home/memory/skypilot/sky/provision/runpod/instance.py”, line 22, in _filter_instances
    instances = utils.list_instances()
  File “/home/memory/skypilot/sky/provision/runpod/utils.py”, line 83, in list_instances
    for port in instance[‘runtime’][‘ports’]:
TypeError: ‘NoneType’ object is not iterable

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 26, 2024

Revert since instance.get('runtime') could be None.

Comment on lines +80 to +84
# Sometimes when the cluster is in the process of being created,
# the `port` field in the runtime is None and we need to check for it.
if (instance['desiredStatus'] == 'RUNNING' and
instance.get('runtime') and
instance.get('runtime').get('ports')):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, do we need to retry until the port is available, instead of just skipping it? Otherwise, our provision may stuck?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already have such logic in the provision lib:

# Wait for instances to be ready.
while True:
instances = _filter_instances(cluster_name_on_cloud, ['RUNNING'])
ready_instance_cnt = 0
for instance_id, instance in instances.items():
if instance.get('ssh_port') is not None:
ready_instance_cnt += 1
logger.info('Waiting for instances to be ready: '
f'({ready_instance_cnt}/{config.count}).')
if ready_instance_cnt == config.count:
break

@cblmemo cblmemo added this pull request to the merge queue Aug 28, 2024
Merged via the queue into master with commit f73debf Aug 28, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-runpod-list-ins branch August 28, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants